Machine Learning Project - Let's Code Academy - Gabriel Zinato Rosa

Table of Contents

Introduction


The goal of this project is to develop a study of the 'COVID.csv' dataset, a database which contains information about COVID cases. From symptom diagnosis and other informations about the patients, we'll develop a model to predict confirmed COVID cases.


The variables in the dataset represent:

We'll develop the project according to the following steps:






Data Preparation and Consistency Check

Let's import the libraries we'll use in this step of the project

Reading the data

We'll drop column "Unnamed: 0" because it's a copy of the index

Verifying missing data, duplicates and modifying the types of data:

Let's observe the types of data we have and whether there are missing data:

There are many missing entries in many columns.\ However, some appear to be complete ("sex", "patient_type","age","covid_res")\ We can use some of the data in these columns to complete some of columns with missing data:

Completing the column "pregnancy":

When we observe the column "pregnancy", many of the missing entries coincide with sex 0 (Man):

Therefore, filling these lines does not fill the whole column but it comes very close to filling it.

We'll the "pregnancy" missing values with a 0 when "sex" is Man:

Filling the columns "intubed" e "icu":

The columns "intubed" and "icu" were not filled for patients who were not hospitalized:

Therefore, we'll fill these values with a 0 (the patient was not intubed or sent to the ICU):

About the column "contact_other_covid":

There doesn't seem to be a good relation between the missing data in this column and the data in other columns that would allow us to complete this column the way we completed the other ones.

Considering the nature of the data in this column (the mere opinion of the patient?) and the fact that it would be a lot of missing data, we've decided to work without this column rather than create uncertain data to fill it.

We have these remaining missing entries:

Dropping the remaining missing entries

Verifying duplicated data:

Despite the amount of duplicated data being the largest part of our dataset, I've decided to drop this data to avoid creating problems for our model.

Modifying dtype to spare memory and processing time

Since we only have numeric data, and they're all integers, we can avoid floating point numbers.

Observing the minimum and maximum values of the data in all columns, int8 can store these numbers.

Exploratory Data Analysis:

We'll start with Descriptive Statistics to analyse the basic features of the data:

Let's check the distribution of the target variable:

The target variable is almost perfectly balanced. We can use Undersampling, Oversampling and SMOTE to see if the results of the classification vary but they probably won't change much.

Let's group the data with the values of the target variable to see how each feature behaves:

The relationship above shows how the mean value for most features is higher for patients whose COVID test resulted negative ("covid_res"=0). Since all features have binary values, a feature whose mean is higher than the mean of another feature implies it has more ones than the other feature. For the "age" feature, the average age of people whose test resulted negative was smaller than the average age of people whose test resulted positive.

Let's plot a correlation matrix and see if it shows any variables with a high correlation among them:

Outlier detection

Let's check the distribution of the "age" feature, since it's the only feature which has values outside the range of zero and one.

There are a lot of people with age zero (probably infants) and there are some people with age over 100. Most people seem to be between 40 and 80 years old. We'll plot a boxplot to check the if we should consider some of the patients as outliers.

We'll filter the entries above the upper fence (111 years):

Despite the fact that it's such a small number of entries, we'll remove them. If we were building a pipeline to treat our data after new, unseen data is added to out dataset, we would want the final dataset to not contain outliers as well.

Summarizing our observations so far:

Classification Model

Importing the libraries used in this step:

Separating the data between X (features) and y (target):

As seen before, the target feature is well balanced:

We'll separate the database between training data and test data and use the training data to pick the best model. We'll use the parameter stratify, despite the target being almost perfectly balanced.

Let's define a function to test the models and return a few metrics:

List with the models we'll test and the random seed

Running the function to test the models:

Let's run a Normalization to see which models benefit from this technique. Despite the fact we have tree models, which in theory don't benefit from scaling, some models may benefit a little.

Testing the models once again:

About the evaluation metrics obtained and which model we'll work with:

Considering all this, we'll work with the model CatBoost, which obtained the best Recall.

It's interesting to note that the models Decision Tree and Random Forest performed very badly (with metrics below 0.5, meaning it's under the probability of labeling the entries correctly by simply guessing the labels).

The Normalization affected only the models Logistic Regression, Decision Tree, Random Forest and CatBoost, although the changes in the evaluation metrics were marginal. Still, it makes more sense to scale the data.

Optimization of the CatBoost classifier:

Undersampling, Oversampling and SMOTE

Despite the fact the target variable is already well balanced, let's see if the model has an improvement in performance when we use Undersampling, Oversampling and SMOTE.

Baseline model:

Undersampling

Oversampling

SMOTE

Comparsion between the different methods:

As expected, there were no big changes in the evaluation metrics because the data was already well balanced concerning the target variable. Let's use the data with no treatment because the Recall for class 1 is the best in this case.

Cross Validation and Hyperparameter Optimization

Let's try to optimize the results of the CatBoost classifier through Cross Validation (RepeatedStratifiedKFold) and Hyperparameter Optimization (Optuna).

We'll define an objective function and the hyperparameters Optuna will try to maximize. We'll also use Cross Validation.

Let's see what the best results were:

Visualizations with Optuna's study data

Disconsidering eval_metric and early_stopping_rounds (which were not being tested), the hyperparameter which has the most weight on the test's result is the learning rate, followed closely by the number of estimators.

Now we can use these hyperparameters with our model

Confusion matrix with the final results of our model:

Evaluation metrics for our model:

ROC Curve

Conclusions

We tested different models and optimization techniques in order to obtain the best possible result (best Recall, specially) with the available data for the problem at hands. Still, the result is far from satisfactory, since the evaluation metrics are still low, and it wouldn't be a good idea to use this model in production. Despite this, it was interesting to practice the techniques used.

Some considerations:

Recall does increase, slightly, but at the cost of Precision (Less False Negatives, more False Positives).